예) Emma.txt corpus에서 사람(NNP, 고유대명사)만 추출 & apply stop words
most_common: 출현빈도 높은 단어
6. wordcloud
FreqDist 활용
단어 빈도수에 따른 시각화
내용
1. 말뭉치(corpus)
1 2
import nltk nltk.download('book', quiet=True)
True
1
from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
raw = nltk.corpus.gutenberg.raw('bryant-stories.txt') print(raw[:300])
[Stories to Tell to Children by Sara Cone Bryant 1918]
TWO LITTLE RIDDLES IN RHYME
There's a garden that I ken,
Full of little gentlemen;
Little caps of blue they wear,
And green ribbons, very fair.
(Flax.)
From house to house he goes,
A me
2. 토큰생성(tokenizing)
sentence unit
sent_tokenize: return sentence
1 2
from nltk.tokenize import sent_tokenize sent_tokenize(raw[:300])
["[Stories to Tell to Children by Sara Cone Bryant 1918] \r\n\r\n\r\nTWO LITTLE RIDDLES IN RHYME\r\n\r\n\r\n There's a garden that I ken,\r\n Full of little gentlemen;\r\n Little caps of blue they wear,\r\n And green ribbons, very fair.",
'(Flax.)',
'From house to house he goes,\r\n A me']
word unit
word_tokenize = TreebankWordTokenizer
1 2
from nltk.tokenize import word_tokenize word_tokenize("this's, a, test! ha.")
Displaying 5 of 865 matches:
Emma by Jane Austen 1816 VOLUME I CHAPTER
Jane Austen 1816 VOLUME I CHAPTER I Emma Woodhouse handsome clever and rich w
f both daughters but particularly of Emma Between _them_ it was more the intim
nd friend very mutually attached and Emma doing just what she liked highly est
by her own The real evils indeed of Emma s situation were the power of having
similar: 해당 단어와 비슷한 문맥에서 사용된 단어
1
text.similar('Emma', 10)
she it he i harriet you her jane him that
5. FreqDist
FreqDist: 문서에 사용된 단어의 사용빈도 정보 담는 class
return: {'word': frequency}
사용법1)
Text class의 vocab으로 추출
1 2
fd = text.vocab() type(fd)
nltk.probability.FreqDist
사용법2)
말뭉치에서 추려낸 단어로 FreqDist class 객체 생성
예) Emma.txt corpus에서 사람(NNP, 고유대명사)만 추출 & apply stop words